In this project following analysis will be done using R and EDA (Exploratory Data Analysis) techniques to explore dataset named wineRedQuality.Dataset is derived from the source link P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Let’s see which variables are included in this dataset.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Let us now check the variable types :
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
There are 1599 observations and 13 variables (including X) .We can see that X variable appears to be an index value for each observation.Also,we notice that quality variables are in integers and all other variables are numerical.
Let us drop the X variable which is used only for indexing purpose :
Checking variable after deleting X
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
Lets us check the distribution of each variable by plotting histograms :
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
After checking the ratings and variable distribution, I’ll create another categorical variable, classifying the wines as ‘bad’ (rating 0 to 4), ‘average’ (rating 5 or 6), and ‘good’ (rating 7 to 10).
## bad average good
## 63 1319 217
Now we will visualize the distribution variability of each factor by plotting each variable histogram:
## Using rating as id variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From above plots it is seen most of the variables are closer to normal distribution except “chlorides” and “residual sugar”. This seems due to the outliers which we can exclude and replot histograms.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 79 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 80 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
After excluding the outliers the distribution for Residual Sugar and Chlorides also looks normal.
There are 1599 observations and 13 variables (including X) .We can see that X variable appears to be an index value for each observation.Also, quality variables are in integers and all other variables are numerical.
I am interested in the quality ratings of red wine and the which variables influence the red wine’s quality ratings.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
As we see above plot, most of the red wines in the dataset have quality ratings of 5 and 6 as seen above.
For wines the first 2 factors comes in mind is alcohol content and density from the given dataset.Therefore it would be interesting to analyse the relationship between alchohol content and wine density. let’s have a look at relation of these 2 variables:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
This shows right skewed graph where density is maximum with alchohol content is between 9 to 10%. We can see that the density decreases as the alchohol content increases.
Yes, I have created another variable “rating” to categorize quality of wine into three groups (good,average,bad) to have good & summarized view of wine quality in graphs and figures.
It was seen that most of the variables are more or less normally distributed except “chlorides” and “residual sugar”. This was due to the outliers which were negated by excluding 95th percentile of these 2 variables and reploted histograms.
To answer above, let’s create a scatterplot matrix to check correlation of these variables :
## Warning in ggscatmat(wine_data): Factor variables are omitted in plot
Here are some interesting Correlations derived from above scatter plot :
It seems that the following variables have relatively highest positive correlations to wine quality:
Here are the variables that have relatively highest negative correlation coefficients to wine quality :
So we observe that above volatile acids are negatively correlated with quality of red wine.
Here are the highest positive and negative correlation reflecting in scatter plot :
alcohol:quality = 0.48
density:alcohol = -0.50
From the correlation matrix we created above, I think it would be interesting to analyze interraction between following variables. Let’s see how some of the important variables compare, plotted against each other in bad,average & good quality wines:
Interestingly,it is seen from above plots that Density and fixed acidity both are positively correlated with each other.Even though the fixed acidity is positively and density is negatively correlated with wine quality.
The strongest relationship appears between Wine quality and Alcohol content. It appears that wines with high alchohol content has more high quality ratings.
Let’s create plots to check some of the strong interaction I think would be interesting to see between variables based on correlation matrix for “good”, “average” & “bad” wines :
From above graphs, I am choosing density, alcohol and sulphates to check in details their influence on wine quality by creating the following plots in this section.
As seen above that Good quality wines have less of density and more alcohol. Above graph shows the kernel density as geom_density computes and draws kernel density estimate, which is a smoothed version of the histogram.
Let’s create scatter plots to check influence of the chemical density and alcohol on wine quality
We see in above graph that Good quality wines have more of sulphate and less in kernel density. Here is the scatter plot to check influence of the chemical density and alcohol on wine quality
Above graph shows that for good wine quality the density is on lesser side and sulphates are more as compared to bad and average quality wines
Thus,from all our above observations it appears that “Alcohol” and “Sulphates” are positively correlated with good quality wine but “Density” is negatively correlated with quality and found less in good quality wines.
In our final plots, let’s check how all the acids in provided dataset influence the wine quality :
## [1] "Median of fixed.acidity by quality:"
## wine_data$quality: 3
## [1] 7.5
## --------------------------------------------------------
## wine_data$quality: 4
## [1] 7.5
## --------------------------------------------------------
## wine_data$quality: 5
## [1] 7.8
## --------------------------------------------------------
## wine_data$quality: 6
## [1] 7.9
## --------------------------------------------------------
## wine_data$quality: 7
## [1] 8.8
## --------------------------------------------------------
## wine_data$quality: 8
## [1] 8.25
We can see that the there is increase of fixed acidity from average quality rating (6) to high quality rating (7). Also big dispersion of fixed acidity value from across the scale which indicates that fixed acidity value can’t be the only factor for good quality wine and quality depends on other factors too.
## [1] "Median of citric.acid by quality:"
## wine_data$quality: 3
## [1] 0.035
## --------------------------------------------------------
## wine_data$quality: 4
## [1] 0.09
## --------------------------------------------------------
## wine_data$quality: 5
## [1] 0.23
## --------------------------------------------------------
## wine_data$quality: 6
## [1] 0.26
## --------------------------------------------------------
## wine_data$quality: 7
## [1] 0.4
## --------------------------------------------------------
## wine_data$quality: 8
## [1] 0.42
We see that for good quality ratings the citric acid is on higher side, which states that higher the citric acid, higher will the quality of wine but ofcourse the qualtity should be measured and not in excess.
## [1] "Median of volatile.acidity by quality:"
## wine_data$quality: 3
## [1] 0.845
## --------------------------------------------------------
## wine_data$quality: 4
## [1] 0.67
## --------------------------------------------------------
## wine_data$quality: 5
## [1] 0.58
## --------------------------------------------------------
## wine_data$quality: 6
## [1] 0.49
## --------------------------------------------------------
## wine_data$quality: 7
## [1] 0.37
## --------------------------------------------------------
## wine_data$quality: 8
## [1] 0.37
Lower volatile acidity seems to mean higher wine quality, as it is reflected in correlation matrix i.e volatile.acidity:quality = -0.39
## [1] "Median of pH by quality:"
## wine_data$quality: 3
## [1] 3.39
## --------------------------------------------------------
## wine_data$quality: 4
## [1] 3.37
## --------------------------------------------------------
## wine_data$quality: 5
## [1] 3.3
## --------------------------------------------------------
## wine_data$quality: 6
## [1] 3.32
## --------------------------------------------------------
## wine_data$quality: 7
## [1] 3.28
## --------------------------------------------------------
## wine_data$quality: 8
## [1] 3.23
In above graph we see that the pH value should be lesser for higher quality wines.
From above plots we can conclude that :
We will create plots to check the correlation of Alcohol and Sulphates for wines with given quality ratings.
## wine_data$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4950 0.5600 0.5922 0.6000 2.0000
## --------------------------------------------------------
## wine_data$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3700 0.5400 0.6100 0.6473 0.7000 1.9800
## --------------------------------------------------------
## wine_data$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7435 0.8200 1.3600
## wine_data$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.22 11.00 13.10
## --------------------------------------------------------
## wine_data$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.00 10.25 10.90 14.90
## --------------------------------------------------------
## wine_data$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.60 11.52 12.20 14.00
As seen above that the Alcohol mean value should be around 11.52 (% by volume) and sulphates should be around 0.7435(potassium sulphate) g/dm3.These plot shows that higher the % of alcohol and sulphates results in better wines.
This boxplots shows the effect of alcohol content on wine quality.It is seen that higher alcohol content is correlated with higher wine quality. But it is also worth noticing that alchol content alone did not produce a higher quality as shown by the outliers and intervals.
It was interesting dataset to explore by using R and EDA techniques. Here I focused to find which variables determines the better quality of red wine. I checked the dataset and cleared some outliers found in histograms for couple of variables to get the precise results. I chose variables based on their correlation coefficients to draw the relations between them and to determine the influence they put on wines quality.
After all the analysis it can be concluded that the major factors are alcohol, acidity and sulphates which determines the wine quality. Quality of wine is positively correlated with alcohol, sulphates and acids (except volatile acids). So good quality wines are rich in these factors. There is negative correlation between pH and wine quality. Sulfur dioxide & Residual sugar doesn’t seems to have much impact on the quality of the wines.
It was interesting to see that even though the fixed acidity is positively and density is negatively correlated with wine quality but it is seen that both are positively correlated with each other. It would be more interesting to add other factors like aging & wine brands as well in future analysis.
I struggled to choose the most appropriate graph and which variables are most strong to compare with each other for a given context. I created and used correlation matrix for given variables to write out a list of the variables comparisions and applicable graphs at my disposal and determined the strengths/weaknesses of each. This made for me easy to choose different plots for various factor combination.